# K-ATHENA: a performance portable structured grid finite volume magnetohydrodynamics code

Philipp Grete, Forrest W. Glines, and Brian W. O'Shea

Abstract—Large scale simulations are a key pillar of modern research and require ever-increasing computational resources. Different novel manycore architectures have emerged in recent years on the way towards the exascale era. Performance portability is required to prevent repeated non-trivial refactoring of a code for different architectures. We combine ATHENA++, an existing magnetohydrodynamics (MHD) CPU code, with KOKKOS, a performance portable on-node parallel programming paradigm, into K-ATHENA to allow efficient simulations on multiple architectures using a single codebase. We present profiling and scaling results for different platforms including Intel Skylake CPUs, Intel Xeon Phis, and NVIDIA GPUs. K-ATHENA achieves  $> 10^8$  cell-updates/s on a single V100 GPU for second-order double precision MHD calculations, and a speedup of 30 on up to 24,576 GPUs on Summit (compared to 172,032 CPU cores), reaching  $1.94 \times 10^{12}$  total cell-updates/s at 76% parallel efficiency. Using a roofline analysis we demonstrate that the overall performance is currently limited by DRAM bandwidth and calculate a performance portability metric of 62.8%. Finally, we present the implementation strategies used and the challenges encountered in maximizing performance. This will provide other research groups with a straightforward approach to prepare their own codes for the exascale era. K-ATHENA is available at https://gitlab.com/pgrete/kathena.

Index Terms—D.2.8.b Performance measures, D.3.2.d Concurrent, distributed, and parallel languages, I.6.8.h Parallel, J.2.i Physics, J.2.c Astronomy

## 1 Introduction

THE era of exascale computing is approaching. Different I projects around the globe are working on the first exascale supercomputers, i.e., supercomputers capable of conducting  $10^{18}$  floating point operations per second. This includes, for example, the Exascale Computing Initiative working with Intel and Cray on Aurora as the first exascale computer in the US in 2021, the EuroHPC collaboration working on building two exascale systems in Europe by 2022/2023, Fujitsu and RIKEN in Japan working on the Post-K machine to launch in 2021/2022, and China who target 2020 for their first exascale machine. While the exact architectural details of these machines are not announced yet and/or are still under active development, the overall trend in recent years has been manycore architectures. Here, manycore refers to an increasing number of (potentially simpler) cores on a single compute node and includes CPUs (e.g., Intel's Xeon Scalable Processor family or AMD's Epyc family), accelerators (e.g., the now discontinued Intel Xeon Phi line), and GPUs for general purpose computing. MPI+OPENMP has been the prevailing parallel programming paradigm in many areas of high performance computing for roughly two decades. It is questionable, however, whether this generic approach will be capable of making efficient use of available hardware features such as parallel threads and vectorization across different manycore architectures and between nodes.

In addition to extensions of the MPI standard such as shared-memory parallelism, several approaches in addition to MPI+OPENMP exist and are being actively developed to address either on-node, inter-node, or both types of parallelism. These include, for example, partitioned global address space (PGAS) programming models such as UPC++ [1], or parallel programming frameworks such as CHARM++ or LEGION, which are based on message-driven migratable objects [2], [3].

Our main goal is a performance portable version of the existing MPI+OPENMP finite volume (general relativity) magnetohydrodynamics (MHD) code ATHENA++ [4], [5]. This goal includes enabling GPU-accelerated simulations while maintaining CPU performance using a single code base. More generally, performance portability refers to achieving consistent levels of performance across heterogeneous platforms using as little architecture-dependent code as possible. Given the uncertainties in future architectures (and the broad availability of different architecture already today) performance portability is an active field of research in many areas [6], [7]. This includes (but is not limited to) idealized benchmarks and miniapps [8], [9], [10], [11], algorithm libraries [12], structured mesh codes [13], or particle in cell codes [14].

In order to keep the code changes minimal, and given the MPI+OPENMP basis of ATHENA++, we decided to keep MPI for inter-node parallelism and focus on on-node performance portability. For on-node performance portability several libraries and programming language extensions exist. With version 4.5 OPENMP [15] has been extended to support offloading to devices such as GPUs, but support and maturity is still highly compiler and architecture dependent. This similarly applies to OPENACC, which has

P. Grete, F. W. Glines, and B. W. O'Shea are with the Department of Physics and Astronomy, Michigan State University, East Lansing, MI 48824 USA, also with the Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI 48824 USA.

<sup>E-mail: grete@pa.msu.edu
B. W. O'Shea is also with the National Superconducting Cyclotron</sup> Laboratory, Michigan State University, East Lansing, MI 48824 USA.

been designed from the beginning to target heterogeneous platforms. While these two directives-based programming models are generally less intrusive with respect to the code base, they only expose a limited fraction of various platform-specific features. OPENCL [16] is much more flexible and allows fine grained control over hardware features (e.g., threads), but this, on the other hand, adds substantial complexity to the code. KOKKOS [17] and RAJA [18] try to combine the strength of flexibility with ease of use by providing abstractions in the form of C++ templates. Both KOKKOS and RAJA focus on abstractions of parallel regions in the code, and KOKKOS additionally provides abstractions of the memory hierarchy. At compile time the templates are translated to different (native) backends, e.g., OPENMP on CPUs or CUDA on NVIDIA GPUs. A more detailed description of these different approaches including benchmarking in more idealized setups can be found in, e.g., [9],

We chose KOKKOS for the refactoring of ATHENA++ for several reasons. KOKKOS offers the highest level of abstraction without forcing the developer to use it by setting reasonable implicit platform defaults. Moreover, the KOKKOS core developer team actively works on integrating the programming model into the C++ standard. New, upcoming features, e.g., in OPENMP, will replace manual implementations in the KOKKOS OPENMP backend over time. KOKKOS is already used in several large projects to achieve performance portability, e.g., the scientific software building block collection Trilinos [19] or the computational framework for simulating chemical and physical reactions Uintah [20]. In addition, KOKKOS is part of the DOE's Exascale Computing Project and we thus expect a backend for Aurora's new Intel Xe architecture when the system launches. Finally, the KOKKOS community, including core developers and users, is very active and supportive with respect to handling issues, questions and offering workshops.

The resulting K-Athena code successfully achieves performance portability across CPUs (Intel, AMD, and IBM), Intel Xeon Phis, and NVIDIA GPUs. We demonstrate weak scaling at 76% parallel efficiency on 24,576 GPUs on OLCF's Summit, reaching  $1.94 \times 10^{12}$  total cell-updates/s for a double precision MHD calculation. Moreover, we calculate a performance portability metric of 62.8% across Xeon Phis, 6 CPU generations, and 3 GPU generations. We make the code available as an open source project<sup>1</sup>.

The paper is organized as follows. In Section 2 we introduce KOKKOS, ATHENA++, and the changes made and approach chosen in creating K-ATHENA. In Section 3 we present profiling, scaling and roofline analysis results. Finally, we discuss current limitations and future enhancements in Sec. 4 and make concluding remarks in Sec. 5.

# 2 METHOD

## **2.1 K**okkos

KOKKOS is an open source<sup>2</sup> C++ performance portability programming model [17]. It is implemented as a template

- 1. K-ATHENA's project repository is located at https://gitlab.com/pgrete/kathena.
- 2. See https://github.com/kokkos for the library itself, associated tools, tutorial and a wiki.

library and offers abstractions for parallel execution of code and data management. The core of the programming model consists of six abstractions.

First, execution spaces define where code is executed. This includes, for example, OPENMP on CPUs or Intel Xeon Phis, CUDA on NVIDIA GPUs, or ROCm on AMD GPUs (which is currently experimental). Second, execution patterns are parallel patterns, e.g. parallel\_for or parallel\_reduce, are the building blocks of any application that uses KOKKOS. These parallel regions are often also referred to as kernels as they can be dispatched for execution on execution spaces (such as GPUs). Third, execution policies determine how an execution pattern is executed. There exist simple range policies that only specify the indices of the parallel pattern and the order of iteration (i.e., the fastest changing index for multidimensional arrays). More complicated policies, such as team policies, can be used for more fine-grained control over individual threads and nested parallelism. Fourth, memory spaces specify where data is located, e.g., in host/system memory or in device space such as GPU memory. Fifth, the memory layout determines the logical mapping of multidimensional indices to actual memory location, cf., C family row-major order versus Fortran column-major order. Sixth, memory traits can be assigned to data and specify how data is accessed, e.g., atomic access, random access, or streaming access.

These six abstractions offer substantial flexibility in fine-tuning application, but the application developer is not always required to specify all details. In general, architecture-dependent defaults are set at compile time based on the information on devices and architecture provided. For example, if CUDA is defined as the default execution space at compile time, all Kokkos::Views, which are the fundamental multidimensional array structure, will be allocated in GPU memory. Moreover, the memory layout is set to column-major so that consecutive threads in the same warp access consecutive entries in memory.

### 2.2 ATHENA++

ATHENA++ is a radiation general relativistic magnetohydrodynamics (GRMHD) code focusing on astrophysical applications [4], [5]. It is a rewrite in modern C++ of the widely used ATHENA C version [21]. ATHENA++ offers a wide variety of compressible hydro- and magnetohydrodynamics solvers including support for special and relativistic (M)HD, flexible geometries (Cartesian, cylindrical, or spherical), and mixed parallelization with OPENMP and MPI. Apart from the overall feature set, the main reasons we chose ATHENA++ are a) its excellent performance on CPUs and KNLs due to a focus on vectorization in the code design, b) a generally well written and documented code base in modern C++, c) point releases are publicly available that contain many (but not all) features<sup>3</sup>, and d) a flexible task-based execution model that allows for a high degree of modularity.

ATHENA++'s parallelization strategy evolves around socalled meshblocks. The entire simulation grid is divided

3. Our code changes are based on the public version, ATHENA++ 1.1.1, see <a href="https://github.com/PrincetonUniversity/athena-public-version">https://github.com/PrincetonUniversity/athena-public-version</a>

into smaller meshblocks that are distributed among MPI processes and/or OPENMP threads. Each MPI processes (or OPENMP thread) owns one or more meshblocks that can be updated independently after boundary information have been communicated. If hybrid parallelization is used, each MPI process runs one or more OPENMP threads that each are assigned one or more meshblock. This design choice is often referred to as coarse-grained parallelization as threads are used at a block (here meshblock) level and not over loop indices. In general, ATHENA++ uses persistent MPI communication handles in combination with one-sided MPI calls to realize asynchronous communication. Moreover, each thread makes its own MPI calls to exchange boundary information. As a result, using more than one thread per MPI process may increase overall on-node performance due to hyperthreading but also increases both the number of MPI messages sent and the total amount of data sent. The latter may result in overall worse parallel performance and efficiency, as demonstrated in Sec 3.3.2.

Listing 1. Example triple for loop for a typical operation in a finite volume method on a structured mesh such as in a code like ATHENA++, where ks, ke, js, je, is, and ie are loop bounds and u is an athena\_array object of, for example, an MHD variable.

```
for( int k = ks; k < ke; k++) {
  for( int j = js; j < je; j++) {
    #pragma omp simd
    for( int i = is; i < ie; i++) {
        /* Loop Body */
        u(k,j,i) = ...
}}</pre>
```

Given the coarse-grained OPENMP approach over meshblocks the prevalent structures in the code base are triple (or quadruple) nested for loops that iterate over the content of each meshblock (and variables in the quadruple case). A prototypical nested loop is illustrated in Listing 1. Generally, all loops (or kernels) in ATHENA++ have been written so that OPENMP simd pragmas are used for the innermost loop. This helps the compiler in trying to automatically vectorize the loops resulting in a more performant application.

### 2.3 K-ATHENA = KOKKOS + ATHENA++

In order to combine ATHENA++ and KOKKOS, four major changes in the code base were required: 1) making Kokkos::Views the fundamental data structure, 2) converting nested for loop structures to kernels, 3) converting "support" functions, such as the equation of state, to inline functions, and 4) converting communication buffer filling functions into kernels.

First, Views are the KOKKOS' abstraction of multidimensional arrays. Thus, the multidimensional arrays originally used in ATHENA++, e.g., the MHD variables for each meshblock, need to be converted to Views so that these arrays can transparently be allocated in arbitrary memory spaces such as device (e.g., GPU) memory or system memory. ATHENA++ already implemented an abstract athena\_array class for all multidimensional arrays with an interface similar to the interface of a View. Therefore, we only had to add View objects as member

variables and to modify the functions of athena\_arrays to transparently use functions of those member Views. This included using View constructors to allocate memory, using Kokkos::deep\_copy or Kokkos::subview for copy constructors and shallow slices, and creating public member functions to access the Views. The latter is required in order to properly access the data from within compute kernels.

Second, all nested for loop structures (see Listing 1 need to be converted to so-called kernels, i.e., parallel region that can be dispatched for execution by an execution space. As described in Sec. 2.1 multiple execution policies are possible, such as a multidimensional range policy (see Listing 2), a one dimensional policy with manual index mapping (see Listing 3), or a team policy that allows for more fine-grained control and nested parallelism (see Listing 4).

Listing 2. Example for loop using KOKKOS. The loop body is reformulated into a lambda function and passed into Kokkos::parallel\_for to execute on the target architecture. The class Kokkos::MDRangePolicy specifies the loop bounds. The array u is now a Kokkos::View, a KOKKOS building block that allows transparent access to CPU and GPU memory. The loop body, i.e., the majority of the code, remains mostly unchanged.

```
parallel_for( MDRangePolicy<Rank<3>>
        ({ks,js,is},{ke,je,ie}),
    KOKKOS_LAMBDA(int k, int j, int i){
        /* Loop Body */
        u(k,j,i) = ...
});
```

Listing 3. Same as Listing 2 but using a one dimensional Kokkos::RangePolicy (implicit through default template parameter) with explicit index calculation.

```
int nk = ke-ks, nj = je-js, ni = ie-is;
parallel_for(nk*nj*ni,
   KOKKOS_LAMBDA(int idx){
   int k = idx / (nj*ni);
   int j = (idx - k*(nj*ni) / ni;
   int i = idx - k*(nj*ni) - j*ni;
   /* Loop Body */
   u(k,j,i) = ...
});
```

Listing 4. Another approach using KOKKOS' nested team-based parallelism through the <code>Kokkos::TeamThreadRange</code> and <code>Kokkos::ThreadVectorRange</code> classes. This interface is closer to the underlying parallelism used by the backend such as CUDA blocks on GPUs and SIMD vectors on CPUs.

Generally, the loop body remained mostly unchanged. Given that it is not a priori clear what kind of execution policy yields the best performance for a given implementation of an algorithm, we decided to implement a flexible loop macro<sup>4</sup>. That macro allows us to easily change the execution policy for performance tests – see profiling results in Sec. 3.3.1 and discussion in Sec. 4, and this intermediate abstraction is similar to the approach chosen in other projects [13].

Third, all functions that are called within a kernel need to be converted into inline functions (here, more specifically using the KOKKOS\_INLINE\_FUNCTION macro). This is required because if the kernels are executed on a device such as a GPU, the function need to be compiled for the device (e.g., with a \_\_device\_\_ attribute when compiling with CUDA). In ATHENA++, this primarily concerned functions such as the equation of state and coordinate system-related functions

Fourth, ATHENA++ uses persistent communication buffers (and MPI handles) to exchange data between processes. Originally, these buffers resided in the system memory and were filled directly from arrays residing in the system memory. In the case where a device (such as a GPU) is used as the primary execution space and the arrays should remain on the device to reduce data transfers, the buffer filling functions need to be converted too. Thus, we changed all buffers to be Views and converted the buffer filling functions into kernels that can be executed on any execution space. In addition, this allows for CUDA-aware MPI-GPU buffers to be directly copied between the memories of GPUs (both on the same node and on different nodes) without an implicit or explicit copy of the data to system memory.

In general, the first three changes above are required in refactoring any legacy code to make use of KOKKOS. We note that the original ATHENA++ design made it mostly straightforward to implement those changes, e.g., because of the existence of an abstract array class and the prevailing tightly nested loops already optimized for vectorized instructions. More broadly, we expect that structured grid fluid codes will require similar changes and that other algorithms and application may require more subtle refactoring in order to achieve good performance. The fourth change was required more specifically for ATHENA++ due to the existing MPI communication patterns.

Finally, for the purpose of the initial proof-of-concept, we only refactored the parts required for running hydrodynamic and magnetohydrodynamic simulations on static and adaptive Cartesian meshes. Running special and general relativistic simulations on spherical or cylindrical coordinates is currently not supported. However, the changes required to allow for these kind of simulations are straightforward and we encourage and support contributions to re-enable this functionality.

Throughout the development process, we continuously measured the code performance in detail using so-called KOKKOS profiling regions as well as the automated profiling of all KOKKOS kernels. Moreover, we employed automated regression testing using GitLab's continuous integration fea-

4. Note, that in newer versions of the code we replaced the macro with a template.

tures and included specific tests to address changes related to KOKKOS (such as running on different architectures and testing different loop patterns).

## 3 RESULTS

If not noted otherwise, all results in this section have been obtained using a double precision, shock-capturing, unsplit, adiabatic MHD solver consisting of Van Leer integration, piecewise linear reconstruction, Roe Riemann solver, and constrained transport for the integration of the induction equation (see, e.g., [22] for more details). The test problem is a linear fast magnetosonic wave on a static, structured, three-dimensional grid. In GPU runs there is no explicit data transfer between system and GPU memory except during problem initialization, i.e., the exchange of ghost cells is handled either by direct copies between buffers in GPU memory on the same GPU or between buffers in GPU memory on different GPUs using CUDA-aware MPI. Similarly, there is also no implicit data transfer as unified memory was not used. Generally, we used the Intel compilers on Intel platforms, and gcc and nvcc on other platforms as we found that (recent) Intel compilers are more effective in automatic vectorization than (recent) gcc compilers. We used the identical software environment and compiler flags for both K-ATHENA and ATHENA++ where possible. Details are listed in Table 1. We used ATHENA++ version 1.1.1 (commit 4d0e425) and K-ATHENA commit 73fec12d for the scaling tests. Additional information on how to run K-ATHENA on different machines can be found in the code's documentation.

#### 3.1 Profiling

In order to evaluate the effect on performance of the different loop structures presented in Sec. 2.3 we compare the timings of different regions within the main loop of the code. The results using both an NVIDIA V100 GPU and an Intel Skylake CPU for a selection of the computationally most expensive regions are shown in Fig. 1. The 1DRange loop structure refers to a one dimensional range policy over a single index that is explicitly unpacked to the multidimensional indices in the code (cf. Listing 3). While this 1DRange is the fastest loop structure for all regions on the GPU, it is the slowest for all regions on the CPU. According to the compiler report this particular one dimensional mapping prevents automated vectorization optimizations. All other loop structures tested, i.e., simd-for (cf. Listing 1), MDRange (cf. Listing 2), and TeamPolicy (cf. Listing 4) logically separate the nested loops and, thus, make it easier for the compiler to automatically vectorize the innermost loop. This also explains why the results for simd-for, MDRange, and TeamPolicy are very close to each other for all regions except the Riemann solver. The Riemann solver is the most complex kernel in the chosen setup so that the compiler is not automatically vectorizing this loop despite the #pragma ivdep in KOKKOS' MDRange and TeamPolicy. Only the more aggressive explicit #pragma omp simd results in a vectorized loop. The aggregate performance differences (all kernels of a cycle combined) to the fastest simd-for pattern are 0.78 (TeamPolicy), 0.71 (MDRange), and 0.51 (1DRange).

| TABLE 1                                                        |
|----------------------------------------------------------------|
| Software Environment and Compiler Flags Used In Scaling Tests. |

| Machine    | Compiler                 | Compiler flags                                                                                                                      | MPI version            |  |  |  |
|------------|--------------------------|-------------------------------------------------------------------------------------------------------------------------------------|------------------------|--|--|--|
| Summit GPU | GCC 6.4.0 & Cuda 9.2.148 | -03 -std=c++11 -fopenmp -Xcudafe<br>diag_suppress=esa_on_defaulted_function_ignored<br>-expt-extended-lambda -arch=sm_70 -Xcompiler | ted_function_ignored   |  |  |  |
| Summit CPU | GCC 8.1.1                | -O3 -std=c++11 -fopenmp-simd -fwhole-program -flto -ffast-math -fprefetch-loop-arrays -fopenmp -mcpu=power9 -mtune=power9           | Spectrum MPI 10.2.0.11 |  |  |  |
| Titan GPU  | GCC 6.3.0 & Cuda 9.1.85  | -03 -std=c++11 -fopenmp -Xcudafe<br>diag_suppress=esa_on_defaulted_function_ignored<br>-expt-extended-lambda -arch=sm_35 -Xcompiler | Cray MPICH 7.6.3       |  |  |  |
| Titan CPU  | GCC 6.3.0                | -03 -std=c++11 -fopenmp                                                                                                             | Cray MPICH 7.6.3       |  |  |  |
| Theta      | ICC 18.0.0               | -O3 -std=c++11 -ipo -xMIC-AVX512 -inline-forceinline -qopenmp-simd -qopenmp                                                         | Cray MPICH 7.7.3       |  |  |  |
| Electra    | ICC 18.0.3               | -O3 -std=c++11 -ipo -inline-forceinline<br>-qopenmp-simd -qopt-prefetch=4 -qopenmp<br>-xCORE-AVX512                                 | HPE MPT 2.17           |  |  |  |



Fig. 1. Profiling results on a GPU (left) and CPU (right) for selected regions (x-axis) within the main loop of an MHD timestep using the algorithm described in Sec. 3. The different lines correspond to different loop structures, see Sec. 2.3 and the timings are normalized to the fastest Riemann region in each panel.

On the GPU, MDRange is the slowest loop structure, being several times (2x-4x) slower than the 1DRange across all regions. TeamPolicy is on par with 1DRange for half of the regions shown. Here, the aggregate performance differences to the fastest 1DRange pattern are 0.75 (TeamPolicy) and 0.078 (MDRange). As discussed in more detail in Sec. 4, we expected these non-optimized raw loop structures to not cause any major differences in performance.

The results shown here for V100 GPUs and Skylake CPUs equally apply to other GPU generations and other CPUs (and Xeon Phis), respectively. For all tests conducted in the following, we use the loop structure with the highest performance on each architecture, i.e., 1DRange on GPUs and simd-for on CPUs and Xeon Phis.

## 3.2 Performance portability

Our main objective for writing K-ATHENA is an MHD code that runs efficiently on any current supercomputer and

possibly any future machines. A code that runs efficiently on more architectures is said to be performance portable. Determining what is meant by "efficient code" can be vague, especially when comparing performance across different architectures. The memory space sizes, bandwidths, instruction sets, and arrangement of cores on different architectures can all affect how efficiently a code can utilize the hardware.

In order to make fair comparisons of K-ATHENA's performance across different machines (see Sec. 3.2.1), we used the roofline model [23], described in Sec. 3.2.2, to compute on several architectures the architectural efficiency of K-ATHENA, or the fraction of the performance achieved compared to the theoretical performance as limited hardware. We then used the architectural efficiencies to compute the performance portability metric from [24], described in Sec. 3.2.3, to quantify the performance portability of K-ATHENA.

TABLE 2

Technical specifications for devices used in the performance portability metric. Cache size and core counts for CPUs specify the aggregate sizes and counts for a two-socket node while numbers for GPUs show the aggregate for a single device. For the Tesla K80, the cache size and core count is for just one of the two GK210 chips in the GPU. For DRAM bandwidth we use the empirically measured bandwidth of the DRAM on CPUs and the global memory on GPUs. Data for Intel devices comes from [25] and data for NVIDIA devices comes from [26], [27], [28], [29].

| Manufacturer             | Intel           | Intel         | Intel   | Intel     | Intel        | Intel           | Intel              | NVIDIA | NVIDIA | NVIDIA |
|--------------------------|-----------------|---------------|---------|-----------|--------------|-----------------|--------------------|--------|--------|--------|
| Family                   | Xeon E5         | Xeon E5       | Xeon E5 | Xeon E5   | Xeon<br>Gold | Xeon<br>Gold    | Xeon Phi           | Tesla  | Tesla  | Tesla  |
| Microarchitecture        | Sandy<br>Bridge | Ivy<br>Bridge | Haswell | Broadwell | Skylake      | Cascade<br>Lake | Knights<br>Landing | Kepler | Pascal | Volta  |
| Model                    | 2670            | 2680v2        | 2680v3  | 2680v4    | 6148         | 6248            | 7250               | K80    | P100   | V100   |
| Instruction Set          | AVX             | AVX           | AVX2    | AVX2      | AVX512       | AVX512          | AVX512             |        |        |        |
| CUDA Capability          |                 |               |         |           |              |                 |                    | 3.7    | 6.0    | 7.0    |
| Clock Rate (GHz)         | 2.6             | 2.8           | 2.5     | 2.4       | 2.4          | 2.5             | 1.4                | 0.562  | 1.328  | 1.29   |
| Num. Cores               | 16              | 20            | 24      | 28        | 40           | 40              | 68                 | 832    | 1792   | 2560   |
| Max L1 Data Cache (KB)   | 512             | 640           | 768     | 896       | 1280         | 1280            | 2176               | 1456   | 1344   | 10240  |
| Total L2 Data Cache (KB) | 4096            | 2560          | 5120    | 7168      | 40000        | 40000           | 34000              | 1536   | 4096   | 6144   |
| Total L3 Data Cache (MB) | 40              | 50            | 60      | 70        | 55           | 55              |                    |        |        |        |
| DRAM Bandwidth (GB/s)    | 97.9            | 121           | 139     | 147       | 246          | 247             | 494                | 195    | 521    | 782    |

#### 3.2.1 Overview of architectures used

In total, we created roofline models for six Intel CPUs, Intel Xeon Phis, and three NVIDIA GPUs. The CPU models roughly follow Intel's tick-tock production model and, thus, span pairs of three different instructions sets (AVX, AVX2, and AVX512) with one CPU introducing a new instruction set and the other an increase in cores and/or clock rate with the same instruction set. The Intel Xeon Phi (Knights Landing) also supports AVX512 instructions and differs from the CPUs at the highest level by an increased core count, lower clock rate, and access to MCDRAM. The three different NVIDIA GPUs span three different microarchitectures (Kepler, Pascal, and Volta), which also translates to an increased core count in the GPUs used. L1 data caches are also implemented differently across the three microarchitectures. On Kelper and Volta GPUs, the L1 cache is physically in the same memory device as CUDA "shared" memory while on Pascal GPUs the L1 cache is combined with texture memory [26], [27], [28]. Load throughput to L1 cache on Pascal GPUs achieves lower bytes/cycle compared to Kelper and Volta GPUs [29], which led to K-Athena maintaining a higher fraction of peak L1 bandwidth. An comparative overview of the technical specifications for all architectures is given in Table 2.

## 3.2.2 Roofline model

The roofline model is a graphical tool to demonstrate the theoretical peak performance of an application on an architecture by condensing the performance limits imposed by the bandwidth of each memory space and peak throughput of the device into a single plot. In a roofline model plot, peak throughputs and bandwidths of the hardware are plotted on a log Performance [FLOPS] versus log arithmetic intensity [FLOP/B] axis so that throughputs are horizontal lines and bandwidths as  $P \propto I$  lines (since bandwidth-limited  $P = B \times I$ ), where P [FLOPS] is performance<sup>5</sup>, I [FLOP/B] is arithmetic intensity (the operations executed per byte read and written), and B [B/s] is the bandwidth. The arithmetic intensities of each memory space for a specific application

appear as vertical lines, extending up where the bandwidth of the memory space limits performance.

The maximum theoretical performance of an application is limited by the bandwidth and throughput ceilings displayed in the roofline model. For the given device and application, the maximum obtainable performance in FLOPS is limited by

$$P_{\max}(a, p, i) \le \min_{m \in M} \left\{ \min \left[ T_{\text{Peak}}(i), \right] \right.$$

$$\left. B(i, m) \times I(a, p, i, m) \right] \right\},$$

$$(1)$$

where  $P_{\rm max}(a,p,i)$  [FLOPS] is the maximum possible FLOPS obtainable by application a solving problem p on architectural platform i,  $T_{\rm Peak}(i)$  [FLOPS] is the peak throughput on the platform, M is all the memory spaces on the device (L1 cache, L2 cache, DRAM, etc.), and I(a,p,i,m) [FLOP/B] is the arithmetic intensity the application solving the problem on the memory space m, or the number of FLOP executed per number of bytes written and read to and from m. We can also mark the actual performance of application with a horizontal dashed line, indicating the actual average FLOPS achieved. Figures 2a and 2b show roofline models of K-ATHENA solving a  $256^3$  linear wave on an Intel Cascade Lake CPU node on NASA's Aitken and a single NVIDIA Volta V100 GPU on MSU's HPCC.

Using the roofline model, we can quantify the architectural efficiency of the K-ATHENA, or the fraction of performance achieved compared to the theoretical maximum performance of the algorithm as limited by bandwidth. In this work, we further distinguish multiple architectural efficiencies per platform as limited by the bandwidth of different memory spaces. The architectural efficiency e(a,p,i,m) of the application a solving the problem p on platform i as limited by the bandwidth of the memory space m on platform i is

$$e(a, p, i) = \frac{\varepsilon(a, p, i)}{\min\left(T_{\text{Peak}}(i), B(i, m) \times I(a, p, i, m)\right)} \quad \text{(2)}$$

where  $\varepsilon(a,p,i)$  is the achieved performance of the application a for solving the problem p on the platform i, B(i,m) is the peak DRAM bandwidth on the platform, and I(a,p,i,m) is the arithmetic intensity of the for solving the

<sup>5.</sup> In this work we consider double precision throughput and count FMA instructions as two FLOP on architectures that support it.



(a) Cascade Lake CPU Roofline



Fig. 2. Roofline models of a 2 socket Intel Xeon Gold 6248 "Cascade Lake" CPU node on NASA's Aitken (2a) and a single NVIDIA Tesla V100 "Volta" GPU on MSU HPCC (2b). Theoretical L1 and DRAM bandwidths and theoretical peak throughputs according to manufacturer specifications are shown in dashed line. for For both cases shown here and all other architectures we tested, DRAM bandwidth (or MCDRAM bandwidth for KNLs) is the limiting bandwidth for K-ATHENA's performance.

problem on that platform. For example, on Summit's Volta V100s, K-ATHENA achieves 0.82 TFLOPS while the DRAM bandwidth limits performance to 1.13 TFLOPS, giving to a 72.5% architectural performance as limited by DRAM bandwidth.

Although bandwidths and throughputs can be obtained from vendor specifications and arithmetic intensities can be computed by hand, empirical testing more accurately reflects the actual performance. Acquiring these metrics requires a variety of performance profiling tools on the different architectures and machines. For gathering the bandwidths and throughputs on GPUs, we used GPUMEMBENCH [30] for measuring the L1 bandwidth and the Empirical Roofline Tool (Version 1.1.0) [31] for measuring all other bandwidths and the peak throughput. For computing arithmetic intensities on GPUs,

we used NVIDIA's NVPROF (CUDA Toolkit 9.2.88 on MSU HPCC, 9.2.148 on SDSC Comet) to measure memory usage to calculate arithmetic intensities and total FLOP count to estimate FLOP per finite volume cell update. To measure memory usage of the different caches, we specifically measured total memory transactions from global memory to the SMs (gld\_transactions and gst\_transactions, as a rough proxy for L1 usage), transactions to and from L2 cache (12\_read\_transactions and 12\_write\_transactions), and transactions to and from DRAM/HBM (dram\_read\_transactions and dram\_write\_transactions). Since we do not use atomic memory operations, texture memory, or shared memory, we measured zero transactions from these memory spaces. For Intel CPUs and KNLs, we used Intel Advisor's (version 2019 update 5) built-in hierarchical roofline gathering tools to collect memory bandwidths, throughputs, and arithmetic intensities [32] using the arithmetic intensity from the cacheaware roofline model for the roofline of the highest memory level. For both CPUs and GPUs, we use total memory transactions to cores and SMs as a surrogate for L1 cache usage due to limitations in the memory transaction metrics available. Although some of the memory transactions may not be through L1 cache, in a best case performance scenario the memory transactions to the registers are limited by the fastest cache bandwidth, which is the L1 cache bandwidth.

We used a 3D linear wave on a  $256^3$  cell grid for benchmarking K-ATHENA's performance and arithmetic intensities for the roofline model. Our metric for CPU machines are for two sockets on a node while the metric for KNLs and GPUs are for a single device, or a single GK210 chip for the Tesla K80. In all cases we found that K-ATHENA's performance is limited by the main memory space that accommodates the data for a single MPI task. For GPUs, this is on device DRAM/HBM, for CPUs this is the DDR3/DDR4 DRAM, and for KNLs this was the MCDRAM. This result is expected, since the finite volume MHD method in K-ATHENA is implemented as a series of simple triple or quadruple for-loop kernels that loop over the data in a task without explicitly caching data. Since the data can only fit in its entirety in DRAM, it must be loaded from and written to DRAM within each kernel. Future improvements can be made to K-ATHENA to explicitly cache data in smaller 1D arrays and kept in higher level caches. This would raise the DRAM arithmetic intensity and facilitate faster throughput [33]. Similar improvements have already been implemented upstream in ATHENA++. A more complete solution would involve fusing consecutive kernels into one kernel to reduce DRAM accesses. Given the virtually identical performance between ATHENA++ and K-ATHENA on CPUs (cf. 3.3.1) we expect the roofline model of ATHENA++ to be practically indistinguishable from K-ATHENA on non-GPU platforms.

# 3.2.3 Performance portability metric

Performance portability is at present nebulously defined. It is generally held that a performance portable application can execute wide variety of architectures and achieve acceptable performance, preferably maintaining a single code base for all architectures. In order to make valid comparisons between codes, an objective metric of performance portability is needed.

The metric proposed by [24] quantifies performance portability by the harmonic sum of the performance achieved on each platform, so that

$$P(a, p, H) = \begin{cases} \frac{|H|}{\sum_{i \in H} \frac{1}{e(a, p, i)}} & \text{if } i \text{ is supported } \forall i \in H \\ 0 & \text{otherwise} \end{cases}$$
(3)

where H is the space of all relevant platforms and e(a, p, i)is the performance efficiency of application a to solve the problem p on a platform i. If an application does not support a platform, then it is not performance portable across the platforms and is assigned a metric of 0. The performance efficiency can also be defined as either the application efficiency, the fraction of the performance of the fastest application that can solve the problem on the platform; or as the architectural efficiency, the achieved fraction of the theoretical peak performance limited by the hardware that we computed in Sec. 3.2.2. Since we did not have MHD codes implementing the same method as K-ATHENA on all architectures, we used the architectural efficiencies obtained from the roofline model to compute the performance portability metric. For completeness, we considered the architectural efficiencies as limited by the both the L1 cache and DRAM bandwidths to compute separate performance portability metric against both memory spaces.



Fig. 3. Performance Portability plot of several CPU and GPU machines with different architectures. Individual bars show the performance of K-ATHENA compared to the theoretical peak performance limited by the empirically measured DRAM and L1 bandwidths. Black bars with diamonds denote the theoretical performance limited by the manufacturer reported bandwidths. The performance portability metrics across all architectures for DRAM and L1 are shown with horizontal orange lines where solid orange used the empirically measured bandwidths and dashed orange uses manufacturer reported bandwidths.<sup>7</sup>

In Fig. 3, the architectural efficiencies as measured against the DRAM bandwidth and L1 cache bandwidth are shown with the computed performance portability metrics. K-ATHENA achieved 62.8% DRAM performance portability

7. The high L1 efficiency on the NVIDIA Tesla Pascal P100 is due to a lower obtainable bytes loaded to L1 per cycle compared to the Kepler and Volta GPUs [29], [34]. The lower L1 cache performance makes it easier to obtain a higher efficiency.

and 7.7% L1 cache performance portability, measured across a number of CPU and GPU architectures. In general, K-ATHENA achieved higher efficiencies on newer architectures.

## 3.3 Scaling

## 3.3.1 Single CPU and GPU performance



Fig. 4. Raw performance for double precision MHD (algorithm described in Sec. 3) of K-ATHENA, ATHENA++, and GAMER on a single GPU (left) or CPU (right) for varying problem sizes. Volta refers to an NVIDIA V100 GPU, Pascal refers to an NVIDIA P100 GPU, BDW (Broadwell) refers to a 14-core Xeon E5-2680 CPU, and SKX (Skylake) refers to a 20-core Xeon Gold 6148 CPU. The GAMER numbers were reported in [35] for the same algorithm used here.

In order to compare the degree to which the refactoring of ATHENA++ affected performance we first compare ATHENA++ and K-ATHENA on a single CPU. The right panel of Fig. 4 shows the cell-updates/s achieved on an Intel Broadwell and an Intel Skylake CPU for both codes for varying problem size. Overall, the achieved cell-updates/s are practically independent of problem sizes reaching  $\approx 8 \times 10^6$ on a single Broadwell CPU and  $\approx 1.4 \times 10^7$  on a single Skylake CPU. Moreover, without any additional performance optimizations (see discussion in Sec. 4), K-ATHENA is virtually on par with ATHENA++, reaching 93% or more of the original performance. For comparison, we also show the results of GAMER [35]. It is another recent (astrophysical) MHD code with support for CPU and (CUDA-based) GPU accelerated calculations and has directly been compared to ATHENA++ in [35]. We also find that ATHENA++ (and thus K-ATHENA) is about 1.5 times faster than GAMER on the same CPU.

A slightly smaller difference (factor of  $\approx 1.25$ ) is observed when comparing results for GPU runs as shown in the left panel of Fig. 4. On a P100 Pascal GPU, K-ATHENA is about 1.3 times faster than GAMER, suggesting that the difference in performance is related to the fundamental code design and not related to the implementation of specific computing kernels. On a single V100 Volta GPU, K-ATHENA reaches a peak performance of greater than  $10^8$  cell-updates/s for large problem sizes. In general, the achieved performance in cell-updates/s is strongly dependent on the problem size. For small grids the performance

is more than one order of magnitude lower than what is achieved for the largest permissible grid sizes that still fit into GPU memory. The plateau in performance on GPUs at larger grid sizes is due to DRAM bandwidth impeding K-ATHENA's performance, as discussed in Section 3.2.2.

#### 3.3.2 Weak scaling

Weak scaling results (using the same test problem and algorithm as in Sec. 3.3.1) for K-ATHENA and the original ATHENA++ version on different systems and architectures are shown in Fig. 5. Note that the chosen problem setup (using a single meshblock per MPI process) is effectively not making use of of the asynchronous communication capabilities to allow for overlapping computation and communication.

Overall, the differences between K-ATHENA and ATHENA++ on CPUs and Xeon Phis are marginal. This is expected as K-ATHENA employed simd-for loops for all kernels that are similar to the ones already in ATHENA++. Therefore, the parallel efficiency is also almost identical between both codes, reaching  $\approx 80\%$  on NASA's Electra system with Skylake CPUs (first column in Fig. 5) and  $\approx 70\%$  on ALCF's Theta system with Knights Landing Xeon Phis (second column in Fig. 5) at 2,048 nodes each. Using multiple hyperthreads per core on Theta has no significant influence on the results given the intrinsic variations observed on that system8.

The first major difference is observed on OLCF's Titan (third column in Fig. 5), where results for K-ATHENA on GPUs are included. While the parallel efficiency for both codes remains at 94% up to 8,192 nodes using only CPUs, it drops to 72% when using GPUs with K-ATHENA. However, the majority of loss in parallel efficiency already occurs going from 1 to 8 nodes using GPUs and afterwards remains almost flat. This behavior is equally present for CPU runs but less visible due to the higher parallel efficiency in general. The differences in parallel efficiency between CPU and GPU runs can be attributed to the vastly different raw performance of each architecture. On a single node the single Kepler K20X GPU is about 7 times faster than the 16-core AMD Opteron CPU. Given that the interconnect is identical for GPU and CPU communication, the effective ratio of computation to communication is worse for GPUs. Despite the worse parallel efficiency on GPUs the raw pernode performance using GPUs is still about 5.5 times faster than using CPUs at 8,192 nodes, which is overall comparable to the ratio of theoretical peak performances in both FLOPS and DRAM bandwidth.

K-ATHENA on OLCF's Summit system (last column in Fig. 5) with six Volta V100 GPUs and two 21-core POWER9 CPUs exhibits a GPU weak scaling behavior similar to the one observed on Titan. Going from 1 to 8 nodes results in a loss of 15% and afterwards the parallel efficiency remains almost flat to 76% on 4,096 nodes. The CPU weak scaling results for both codes using CPUs reveal properties of the interconnect. The weak scaling is almost perfect up to 256 nodes using 1 hyperthread per core and afterwards rapidly

8. According to the ALCF support staff, system variability contributes around 10% to the fluctuations in performance between identical runs

plummets. Using 2 hyperthreads per core (i.e., doubling the number of threads making MPI calls and doubling the number of MPI messages sent and received, as described in Sec. 2.2) the steep drop in parallel efficiency is already observed beyond 128 nodes. No such drop is observed using GPUs, which perform 42/6=7 times fewer MPI calls (compared to using 1 hyperthread per core) with larger message sizes in general.

Naturally, this is tightly related to the existing communication pattern in ATHENA++, i.e., coarse grained threading over meshblocks with each thread performing one-sided MPI calls. Without making additional changes to the code base, we can evaluate the effect of reducing the number of MPI calls for a fixed problem size in a multithreaded CPU setup using KOKKOS nested parallelism in K-ATHENA. More specifically, we use the triple nested construct illustrated in Listing 4 allowing multiple threads handling a single meshblock. As a proof of concept, the results for using using 1 MPI process per 2 cores each with one thread are shown in the purple dash line in the last column of Fig. 5. While the raw performance on a single node is slightly lower (about 16%), the improved communication pattern results in a higher overall performance for > 1,024 nodes. Similarly, the sharp drop in parallel efficiency has been shifted to first occur at 2,048 nodes.

At the single node level the six GPUs on Summit are tightly connected via NVLink. The weak scaling efficiency from one GPU to six GPUs on a single node is  $\approx 99\%$  (cf.,  $>6\times10^8$  cell-updates/s/node for a single node in the top right panel of Fig. 5). In addition, the host interconnect has a lower bandwidth and higher latency compared to NVLink. Thus, the intra-node parallel overhead is generally negligible in our analysis.

Finally, the raw per-node performance is overall comparable between Intel Skylake CPUs, Intel Knight Landing Xeon Phis, IBM POWER9 CPUs, and a single NVIDIA Kepler GPU, ranging between  $\approx 1.5-3\times 10^7$  cell-updates/s/node. The latest NVIDIA Volta GPU is a notable exception, reaching more than  $10^8$  cell-updates/s/GPU. This performance, in combination with six GPUs per node on Summit and a high parallel efficiency, results in a total performance of  $1.94\times 10^{12}$  cell-updates/s on 4,096 nodes.

## 3.4 Strong scaling

Strong scaling results for K-ATHENA on Summit on both CPUs and GPUs are shown in Fig. 6 (same test problem and algorithm as in Sec. 3.3.1). Overall, strong scaling in terms of parallel efficiency is better on CPUs than on GPUs. For example, for a  $1,408^3$  domain the parallel efficiency using CPUs remains > 83% going from 32 to 512 nodes whereas it drops to 45% for the similar GPU case  $(1,536^3)$ domain using 36 to 576 nodes). This is easily explained by comparing to the single CPU/GPU performance discussed in Sec. 3.3.1, which effectively corresponds to on-node strong scaling. The more pronounced decrease in parallel efficiency on the GPUs is a direct result of the decreased raw performance of GPUs with smaller problem sizes per GPU. The increased communication overhead of the strong scaling test plays only a secondary role. Therefore, the strong scaling efficiency of K-ATHENA in comparison to



Fig. 5. Weak scaling for double precision MHD (exact algorithm described in Sec. 3) on different supercomputers and architectures for K-ATHENA and the original ATHENA++ version. Numbers correspond to the 80th percentile of individual cycle performances of several runs in order to reduce effects of network variability. The top row shows the raw performance in number of cell-updates per second per node and can directly be compared between different system and architectures. The bottom row shows the parallel efficiency normalized to the individual single node performance. The first column contains results for a workload of  $64^3$  and  $128^3$  cells per core on NASA's Electra system using two 20-core Intel Xeon Gold 6148 processors per node. The second column shows results for a workload of  $64^3$  per core on ALCF's Theta system with one 64-core Intel Xeon Phi 7230 (Knights Landing) per node. HT-1, HT-2, and HT-4 refers to using 1, 2, and 4 hyperthreads per core, respectively. The third column shows results for a workload of  $128^3$  per CPU core and  $192^3$  per GPU on OLCF's Titan system with one AMD Opteron 6274 16-core CPU and one NVIDIA K20X (Kepler) GPU per node. The last column contains results for a workload of  $64^3$  per CPU core and  $256^3$  per GPU on OLCF's Summit system with two 21-core IBM POWER9 CPUs and six NVIDIA V100 (Volta) GPUs per node. On all systems the GPU runs used  $10^3$  loops and the CPU runs used  $10^3$  loops with the the exception of the dashed purple line on Summit that used Kokkos nested parallelism, see Sec. 2.3 for more details.

ATHENA++ is expected to be identical. Moreover, additional performance improvements, as discussed in the following Section, will greatly benefit the strong scaling behavior of GPUs in general. Nevertheless, the raw performance of the GPUs still outperforms CPUs by a large multiple despite the worse strong scaling parallel efficiency. For example, in the case discussed above on Summit, the per-node performance of GPUs over CPUs is still about 14 times higher at >512 nodes.

# 4 CURRENT LIMITATIONS AND FUTURE ENHANCE-MENTS

Our primary goal for the current version of K-ATHENA was to make GPU-accelerated simulations possible while maintaining CPU performance, and to do so with the smallest amount of code changes necessary. Naturally, this resulted in several trade-offs and leaves room for further (performance) improvements in the future.

For example, we are currently not making use of the memory hierarchy abstraction provided by KOKKOS. This includes more advanced hardware features such as scratch spaces on GPUs. Scratch space can be shared among threads of a TeamPolicy and allows for efficient reuse of memory. We could use scratch space to reduce the number of reads from DRAM in stenciled kernels (like the fluid solver's reconstruction step). We could also fuse consecutive kernels to further reduce reads and writes to DRAM, although this would also increase register and possibly spill store usage. Moreover, complex kernels such as a Riemann solver could be broken down further by using TeamThreadRanges and ThreadVectorRanges structures that are closer to the structure of the algorithm. This is in contrast to our current approach where all kernels are treated equally, with the same execution policies independent of the individual algorithms within the kernels. The Riemann solver could also be split into separate kernels to reduce the number of registers needed, eliminate the use of spill stores on the GPU, and allow higher occupancy on the GPU.



Fig. 6. Strong parallel scaling for double precision MHD (algorithm described in Sec. 3) of K-ATHENA on NVIDIA V100 GPUs (6 GPUs per node; green solid lines) and IBM Power 9 CPUs (42 cores per node; orange/red dash dotted lines) on Summit. The top panel shows the raw performance in cell-updates per second per node and the bottom panel shows the parallel efficiency. The effective workload per GPU goes from  $256^3$  to  $64^3$  for the  $1,536^3$  domain and from  $256^3$  to  $128^3$  for the  $3072^3$  domain. In the CPU case the effective workload per single Power9 CPU (21 cores) goes from  $353^3$  to  $88^3$  for the  $1,408^3$  domain and from  $353^3$  to  $177^3$  for the  $2,944^3$  domain. The resulting effective workloads per node are comparable (within few percent) between GPU and CPU runs.

Similarly, on CPUs and Xeon Phis we are currently not using a KOKKOS parallel execution pattern. The macro we introduced to easily exchange parallel patterns replaces the parallel region on CPUs and Xeon Phis with a simple nested for loop including a simd pragma, as shown in Listing 1. This is required for maximum performance as the implicit #pragma ivdep hidden in the KOKKOS templates is less aggressive than the explicit #pragma omp simd with respect to automated vectorization. We reported this issue and future KOKKOS updates will address this by either providing an explicit tightly nested vectorized loop pattern and/or adding support for a simd property to the execution policy template.

Another possible future improvement is an increase in parallel efficiency by overlapping communication and computation. While ATHENA++ is already built for asynchronous communication through one-sided MPI calls and a task based execution model, more fine-grained optimizations are possible. For example, spatial dimensions in the variable reconstruction step that occurs after the exchange

of boundary information could be split, so that the kernel in the first dimension could run while the boundary information of the second and third dimension are still being exchanged. In addition, the next major KOKKOS release will contain more support for architecture-dependent task based execution and, for example, will allow for the transparent use of CUDA streams.

CUDA streams may also help in addressing another current limitation of K-ATHENA on GPUs. Our minimal implementation approach currently limits all meshblocks to be allocated in a fixed memory space. This means that the total problem size that can currently be addressed with K-ATHENA is limited by the total amount of GPU memory available. An alternative approach is keeping the entire mesh in system memory, which is still several times larger than the GPU memory on most (if not all) current machines. For the execution of kernels individual meshblocks would be copied back and forth between system memory and GPU memory. Here, CUDA streams could be used to hide these expensive memory transfers as they would occur in the background while the GPU is executing different kernels. Theoretically, meshes larger than the GPU memory could already be used right now with the help of unified memory. However, given that the code is not optimized for efficient page migrations the resulting performance degradation is large (more than a factor of 10). Thus, using unified memory with meshes larger than the GPU memory is not recommended.

#### 5 CONCLUSIONS

We presented K-ATHENA – a KOKKOS-based performance portable version of the finite volume MHD code ATHENA++. KOKKOS is a C++ template library that provides abstractions for on-node parallel regions and the memory hierarchy. Our main goal was to enable GPU-accelerated simulations while maintaining ATHENA++'s excellent CPU performance using a single code base and with minimal changes to the existing code.

Generally, four main changes were required in the refactoring process. We changed the underlying memory management in ATHENA++'s multi-dimensional array class to make transparently use of KOKKOS's equivalent multi-dimensional arrays, i.e., Kokkos::Views. We exchanged all (tightly) nested for loops with the KOKKOS equivalent parallel region, e.g., a Kokkos::parallel\_for, which are now kernels that can be launched on any supported device. We inlined all support functions (e.g., the equation of state) that are called within kernels. We changed the communication buffers to be Views so that MPI calls between GPUs buffers are directly possible without going through system memory.

With all changes in place we performed both profiling and scaling studies across different platforms, including NASA's Electra system with Intel Skylake CPUs, ALCF's Theta system with Intel Xeon Phi Knights Landing, OLCF's Titan with AMD Opteron CPUs and NVIDIA Kepler GPUs, and OLCF's Summit machine with IBM Power9 CPUs and NVIDIA Volta GPUs. Using a roofline model analysis, we demonstrated that the current implementation of the MHD algorithms is memory bound by either the DRAM, HBM, or

MCDRAM bandwidths on CPUs and GPUs. Moreover, we calculated a performance portability metric of 62.8% across Xeon Phis, and 6 CPU and 3 GPU generations.

Detailed Kokkos profiling revealed that there is currently no universal Kokkos execution policy (how a parallel region is executed) that achieves optimal performance across different architectures. For example, a one-dimensional loop with manual index matching from 1 to 3D/4D is fastest on GPUs (achieving  $>10^8$ ) double precision MHD cell-updates/s on a single NVIDIA V100 GPU) whereas tightly nested for loops with simd directives are fastest on CPUs. This is primarily a result of Kokkos's specific implementation details and expected to improve in future releases through more flexible execution policies.

Strong scaling on GPUs is currently predominately limited by individual GPU performance and not by communication. In other words, insufficient GPU utilization outweighs additional performance overhead with decreasing problem size per GPU.

Weak scaling is generally good, with parallel efficiencies of 80% and higher for more than 1,000 nodes across all machines tested. Notably, on Summit K-ATHENA achieves a total calculation speed of  $1.94\times10^{12}$  cell-updates/s on 24,567 V100 GPUs at a speedup of 30 compared to using the available 172,032 CPU cores.

Finally, there is still a great deal of untapped potential left, e.g., using more advanced hardware features such as fine-grained nested parallelism, scratch pad memory (i.e., fast memory that can be shared among threads), or CUDA streams. These are currently being addressed within the new PARTHENON collaboration (https://github.com/lanl/parthenon), which is developing a performance portable adaptive mesh refinement framework based on the results presented here.

Nevertheless, we achieved our primary performance portability goal of enabling GPU-accelerated simulations while maintaining CPU performance using a single code base. Moreover, we consider the current results highly encouraging and will continue with further development on the project's GitLab repository at <a href="https://gitlab.com/pgrete/kathena">https://gitlab.com/pgrete/kathena</a>. Contributions of any kind are welcome!

## **ACKNOWLEDGMENTS**

The authors would like to thank the KOKKOS developers, particularly Christian Trott and Steve Bova, and the organizers of the 2018 Performance Portability with KOKKOS Bootcamp for their help using KOKKOS in ATHENA++. We thank Kevin O'Leary and Dunni Aribuki for support with the Intel Advisor. We thank Kristian Beckwith for inspiring discussions on KOKKOS. We thank the ATHENA++ team for making their code public and for their well designed code. We acknowledge funding by NASA Astrophysics Theory Program grant #NNX15AP39G. Code development, testing, and benchmarking was made possible through various computing grants including allocations on NASA Pleiades (SMD-16-7720), OLCF Titan (AST133), OLCF Summit (AST146), ALCF Theta (athena\_performance), XSEDE Comet (TG-AST090040), and Michgian State University's High Performance Computing Center.

#### REFERENCES

- [1] Y. Zheng, A. Kamil, M. B. Driscoll, H. Shan, and K. Yelick, "Upc++: A pgas extension for c++," in 2014 IEEE 28th International Parallel and Distributed Processing Symposium, May 2014, pp. 1105–1114.
- [2] L. V. Kale and S. Krishnan, "Charm++: A portable concurrent object oriented system based on c++," Champaign, IL, USA, Tech. Rep., 1993.
- [3] M. Bauer, S. Treichler, E. Slaughter, and A. Aiken, "Legion: Expressing locality and independence with logical regions," in *Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis*, ser. SC '12. Los Alamitos, CA, USA: IEEE Computer Society Press, 2012, pp. 66:1–66:11. [Online]. Available: http://dl.acm.org/citation.cfm?id=2388996.2389086
- [4] C. J. White, J. M. Stone, and C. F. Gammie, "An extension of the athena++ code framework for grmhd based on advanced riemann solvers and staggered-mesh constrained transport," *The Astrophysical Journal Supplement Series*, vol. 225, no. 2, p. 22, 2016. [Online]. Available: http://stacks.iop.org/0067-0049/225/i=2/a=
- [5] J. M. Stone, K. Tomida, C. J. White, and K. G. Felker, "The athena++ adaptive mesh refinement framework: Design and magnetohydrodynamic solvers," 2020.
- [6] T. P. Straatsma, K. B. Antypas, and T. J. Williams, Exascale Scientific Applications: Scalability and Performance Portability, 1st ed. Chapman & Hall/CRC, 2017.
- [7] J. C. Bennett, G. M. Baker, M. T. Bettencourt, S. W. Bova, K. Franko, M. Gamell, R. Grant, S. D. Hammond, D. S. Hollman, S. Knight, H. Kolla, P. Lin, S. L. Olivier, G. D. Sjaardema, N. L. Slattengren, K. Teranishi, J. J. Wilke, R. L. Clay, L. Kale, N. Jain, E. Mikida, A. Aiken, M. Bauer, W. Lee, E. Slaughter, S. Treichler, M. Berzins, T. Harman, A. Humphreys, J. Schmidt, D. Sunderland, P. Mccormick, S. Gutierrez, M. Shulz, T. Gamblin, and P. T. Bremer, "Asc atdm level 2 milestone #5325: Asynchronous many-task runtime system analysis and assessment for next generation platforms," 9 2015.
- [8] M. A. Heroux, D. W. Doerfler, P. S. Crozier, H. K. Thornquist, R. W. Numrich, A. B. Williams, H. C. Edwards, E. R. Keiter, M. Rajan, and J. M. Willenbring, "Improving performance via mini-applications." 9 2009.
- [9] M. Martineau, S. McIntosh-Smith, and W. Gaudin, "Assessing the performance portability of modern parallel programming models using tealeaf," Concurrency and Computation: Practice and Experience, vol. 29, no. 15, p. e4117, 2017, e4117 cpe.4117. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe. 4117
- [10] T. Deakin, J. Price, M. Martineau, and S. McIntosh-Smith, "Evaluating attainable memory bandwidth of parallel programming models via babelstream," *International Journal of Computational Science and Engineering*, vol. 17, no. 3, pp. 247–262, 2018. [Online]. Available: https://www.inderscienceonline.com/ doi/abs/10.1504/IJCSE.2018.095847
- [11] J. R. Hammond and T. G. Mattson, "Evaluating data parallelism in c++ using the parallel research kernels," in *Proceedings of the International Workshop on OpenCL*, ser. IWOCL19. New York, NY, USA: Association for Computing Machinery, 2019. [Online]. Available: https://doi.org/10.1145/3318170.3318192
- [12] M. A. Heroux and J. M. Willenbring, "A new overview of the trilinos project," *Scientific Programming*, vol. 20, no. 2, 1 2012.
- [13] J. K. Holmen, B. Peterson, and M. Berzins, "An approach for indirectly adopting a performance portability layer in large legacy codes," in 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), Nov 2019, pp. 36–49.
- [14] V. Artigues, K. Kormann, M. Rampp, and K. Reuter, "Evaluation of performance portability frameworks for the implementation of a particle-in-cell code," arXiv e-prints, p. arXiv:1911.08394, Nov 2019.
- [15] L. Dagum and R. Menon, "Openmp: An industry-standard api for shared-memory programming," *IEEE Comput. Sci. Eng.*, vol. 5, no. 1, pp. 46–55, Jan. 1998. [Online]. Available: https://doi.org/10.1109/99.660313
- [16] J. É. Stone, D. Gohara, and G. Shi, "Opencl: A parallel programming standard for heterogeneous computing systems," Computing in Science Engineering, vol. 12, no. 3, pp. 66–73, May 2010.
  [17] H. C. Edwards, C. R. Trott, and D. Sunderland, "Kokkos:
- [17] H. C. Edwards, C. R. Trott, and D. Sunderland, "Kokkos: Enabling manycore performance portability through polymorphic memory access patterns," Journal of Parallel and Distributed

- Computing, vol. 74, no. 12, pp. 3202 3216, 2014, domain-Specific Languages and High-Level Frameworks for High-Performance Computing. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0743731514001257
- [18] R. Hornung, H. Jones, J. Keasler, R. Neely, O. Pearce, S. Hammond, C. Trott, P. Lin, C. Vaughan, J. Cook, R. Hoekstra, B. Bergen, J. Payne, and G. Womeldorff, "ASC Tri-lab Co-design Level 2Milestone Report 2015," LLNL, Tech Report LLNL-TR-677453, 2015. [Online]. Available: https://e-reports-ext.llnl.gov/pdf/800854.pdf
- [19] M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra, J. J. Hu, T. G. Kolda, R. B. Lehoucq, K. R. Long, R. P. Pawlowski, E. T. Phipps, A. G. Salinger, H. K. Thornquist, R. S. Tuminaro, J. M. Willenbring, A. Williams, and K. S. Stanley, "An overview of the trilinos project," ACM Trans. Math. Softw., vol. 31, no. 3, pp. 397–423, Sep. 2005. [Online]. Available: http://doi.acm.org/10.1145/1089014.1089021
- [20] J. K. Holmen, A. Humphrey, D. Sunderland, and M. Berzins, "Improving uintah's scalability through the use of portable kokkos-based data parallel tasks," in *Proceedings of the Practice and Experience in Advanced Research Computing* 2017 on Sustainability, Success and Impact, ser. PEARC17. New York, NY, USA: ACM, 2017, pp. 27:1–27:8. [Online]. Available: http://doi.acm.org/10.1145/3093338.3093388
- [21] J. M. Stone, T. A. Gardiner, P. Teuben, J. F. Hawley, and J. B. Simon, "Athena: A new code for astrophysical mhd," *The Astrophysical Journal Supplement Series*, vol. 178, no. 1, p. 137, 2008. [Online]. Available: http://stacks.iop.org/0067-0049/178/i=1/a=137
- [22] J. M. Stone and T. Gardiner, "A simple unsplit godunov method for multidimensional mhd," New Astronomy, vol. 14, no. 2, pp. 139 – 148, 2009. [Online]. Available: http://www.sciencedirect.com/ science/article/pii/S1384107608000754
- [23] S. Williams, A. Waterman, and D. Patterson, "Roofline: An Insightful Visual Performance Model for Multicore Architectures," Commun. ACM, vol. 52, no. 4, pp. 65–76, Apr. 2009.
- [24] S. Pennycook, J. Sewall, and V. Lee, "Implications of a metric for performance portability," *Future Generation Computer Systems*, vol. 92, pp. 947 958, 2019. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167739X17300559
- [25] Intel Corporation. (2016) Intel 64 and IA-32 architectures optimization reference manual. [Online]. Available: https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
- [26] NVIDIA Corporation. (2014) NVIDIAs next generation CUDA compute architecture: Kepler GK110/210. [Online]. Available: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/ NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf
- [27] —. (2016) NVIDIA tesla P100. [Online]. Available: https://images.nvidia.com/content/pdf/tesla/ whitepaper/pascal-architecture-whitepaper.pdf
- [28] —. (2017) NVIDIA tesla V100 GPU architecture. [Online]. Available: https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
- [29] Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, "Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking," arXiv:1804.06826 [cs], Apr. 2018, arXiv: 1804.06826. [Online]. Available: http://arxiv.org/abs/1804.06826
- [30] E. Konstantinidis and Y. Cotronis, "A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling," *Journal of Parallel and Distributed Computing*, vol. 107, pp. 37–56, Sep. 2017.
- [31] Y. J. Lo, S. Williams, B. Van Straalen, T. J. Ligocki, M. J. Cordery, N. J. Wright, M. W. Hall, and L. Oliker, "Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis," in High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, S. A. Jarvis, S. A. Wright, and S. D. Hammond, Eds. Springer International Publishing, 2015, pp. 129–148.
- [32] D. Marques, H. Duarte, A. Ilic, L. Sousa, R. Belenov, P. Thierry, and Z. A. Matveev, "Performance Analysis with Cache-Aware Roofline Model in Intel Advisor," in 2017 International Conference on High Performance Computing Simulation (HPCS), Jul. 2017, pp. 898–907, iSSN: null.
- [33] F. W. Glines, M. Anderson, and D. Neilsen, "Scalable Relativistic High-Resolution Shock-Capturing for Heterogeneous Computing," in 2015 IEEE International Conference on Cluster Computing, Sep. 2015, pp. 611–618, iSSN: 2168-9253.

- [34] Z. Jia, M. Maggioni, J. Smith, and D. P. Scarpazza, "Dissecting the NVidia Turing T4 GPU via Microbenchmarking," arXiv:1903.07486 [cs], Mar. 2019, arXiv: 1903.07486. [Online]. Available: http://arxiv.org/abs/1903.07486
- [35] U.-H. Zhang, H.-Y. Schive, and T. Chiueh, "Magnetohydrodynamics with gamer," *The Astrophysical Journal Supplement Series*, vol. 236, no. 2, p. 50, 2018. [Online]. Available: http://stacks.iop.org/0067-0049/236/i=2/a=50



Philipp Grete After receiving a B.Sc. in Computer Science in 2008 from the University of Cooperative Education in Stuttgart, Germany, Philipp Grete worked for Hewlett-Packard before studying Physics (B.Sc.) and Computer Science (M.Sc.) at the University of Gttingen, Germany, from 2010 to 2013. In his Ph.D. thesis (2014-2016, University of Gttingen, Germany) he worked on subgrid-scale modeling of compressible magnetohydrodynamic turbulence. Since October 2016, he is a postdoctoral

research associate at Michigan State University. His current research interest include fundamental processes in magnetized astrophysical fluids, numerical methods in computational fluid dynamics, and high-performance computing with an emphasis on performance portability.



Forrest W. Glines Forrest Glines is PhD student at Michigan State University working towards a dual degree in Astrophysics and Computational Mathematics, Science, and Engineering. He received his B.S. in Physics from Brigham Young University in 2016, publishing work on magnetohydrodynamics methods using GPUs. In the 2015 and 2016 he also spent 6 months collectively at Los Alamos National Laboratory, simulating galaxy cluster and developing simulations coupling radiative transfer and hydrodynamics.

Forrest's current research interests includes the coupling of galaxy clusters with active galactic nuclei, plasma turbulence, the evolution of galaxies with magnetic fields, and developing simulations for the exascale era.



**Brian W. O'Shea** Brian O'Shea received his PhD in Physics from the University of Illinois at Urbana-Champaign in 2005. His dissertation research focused on simulations of cosmological structure formation, particularly in the first generation of stars in the Universe. He was then a Directors Postdoctoral Fellow at Los Alamos National Laboratory in both the Theoretical Astrophysics Group and the Applied Physics Division, where he studied the formation of galaxies over a wide range of mass scales and cosmic epochs.

Since 2008 he has been a professor at Michigan State University with appointments in the Department of Computational Mathematics, Science, and Engineering, Department of Physics and Astronomy, and National Superconducting Cyclotron Laboratory. His current research interests range across a wide variety of topics in cosmological structure formation, astrophysical plasma turbulence, high-performance computing, and computational science education. He is one of the core developers and leaders of the Enzo community code (enzo-project.org), and a contributor to several other open-source projects.